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ABSTRACT 

The understanding of the immense and intricate topological struc- 
ture of the World Wide Web (WWW) is a major scientific and tech- 
nological challenge. This has been tackled recently by character- 
izing the properties of its representative graphs in which vertices 
and directed edges are identified with web-pages and hyperlinks, 
respectively. Data gathered in large scale crawls have been ana- 
lyzed by several groups resulting in a general picture of the WWW 
that encompasses many of the complex properties typical of rapidly 
evolving networks 1 5 10 22 1 14 1. In this paper, we report a de- 
tailed statistical analysis of the topological properties of four differ- 
ent WWW graphs obtained with different crawlers. We find that, 
despite the very large size of the samples, the statistical measures 
characterizing these graphs differ quantitatively, and in some cases 
qualitatively, depending on the domain analyzed and the crawl used 
for gathering the data. This spurs the issue of the presence of sam- 
pling biases 1 20 4 32 1 and structural differences of Web crawls 
that might induce properties not representative of the actual global 
underlying graph. In order to provide a more accurate characteri- 
zation of the Web graph and identify observables which are clearly 
discriminating with respect to the sampling process, we study the 
behavior of degree-degree correlation functions and the statistics of 
reciprocal connections. The latter appears to enclose the relevant 
correlations of the WWW graph and carry most of the topological 
information of the Web. The analysis of this quantity is also of ma- 
jor interest in relation to the navigability and searchability of the 
Web. 
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I. INTRODUCTION 

The World Wide Web (WWW) has grown at an unprecedented 
pace. While it is not possible to provide a precise estimate of the 
WWW size in terms of pages, a recent study 1 19 1, which used Web 
searches in 75 different languages, determined that there were over 

II. 5 billion Web pages in the publicly indexable Web 1241 1251 at 
the end of January 2005. Furthermore, the Web growth lacks any 
regulation and physical constraint (contrary to what happens with 
the physical Internet infrastructure |30|), with new documents be- 
ing added or becoming obsolete very quickly. 

A fundamental step in decoding and understanding the WWW 
organization consists in the experimental studies of the WWW graph 
structure in which vertices and directed edges are identified with 
Web pages and hyperlinks, respectively. These studies are based on 
crawlers that explore the WWW connectivity by following the links 
on each discovered page, thus reconstructing the topological prop- 
erties of the representative graph. Several studies based on those 
graphs have been performed in order to reveal the large-scale topo- 
logical properties of the WWW. Distributions of in-degrees and 
out-degrees have been found to exhibit heavy-tails and the macro- 
scopic architecture of connected components has made evident a 
rich structural organization, i.e., the so-called bow-tie structure 1 23| 
|5||5|^l|l4 1. Reciprocal links and transitive relations regarding the- 
matic communities L17J have attracted attention as well, giving rise 
to a generally accepted picture of the topological structure of the 
WWW. 

While the importance of these studies is indisputable, the dy- 
namical nature of the Web and its huge size make very difficult the 
process of compressing, ranking, indexing or mining the Web. In- 
deed, even the largest scale Web crawlers cover only a small portion 
of the publicly available information. In other words, it has been 
impossible so far to achieve any complete unbiased large-scale pic- 
ture of the Web. On the other hand, the very large sizes of the 
gathered data sets have led to the general belief that the structural 



and statistical properties observed in the WWW graphs were rep- 
resentative of the actual ones, thus leaving almost untouched the 
study of possible sampling biases |20|. In this respect, on the one 
hand it is crucial to understand clearly which is the exact informa- 
tion provided by crawl engines, and, on the other hand, to explore 
to which extent the Web properties we observe are not biased by 
the specific characteristics of the crawls. 

In this paper, we study four different data sets obtained in dif- 
ferent years with different crawls and for different domains of the 
WWW. Our main contributions are: 

• We provide a careful comparative analysis of the structural 
and statistical topological properties of the different Web graphs, 
making evident qualitative and quantitative differences across 
different samples. We look at higher order statistical indica- 
tors characterizing single and two-vertex correlations in or- 
der to provide a full account of the connectivity pattern and 
structural ordering of the Web graph. See Sections|4|and|5| 

• We identify a novel and crucial topological element, the re- 
ciprocal link, playing a key role in the organization of the 
WWW and accounting for most of the statistical correlations 
observed in Web graphs. Reciprocal links 1181. also referred 
in the literature as bidirectional links 1 8 1 or co- links ri7 1, can 
allow us to clearly discriminate among the statistical proper- 
ties resulting from different crawls. Furthermore, the inspec- 
tion of the subgraphs of vertices reciprocally connected pro- 
vides interesting structural information that might be crucial 
to assess how the underlying topology could affect the func- 
tionality 1 8 1 of the Web and/or processes running on it. In- 
deed, navigability and searchability are intimately related to 
the functionality of the WWW, and those properties strongly 
depend on the communication patterns among the constituent 
sites of the network. See Section|S| 

2. RELATED WORK 

The first empirical topological studies of the Web as a directed 
graph focused on the measure of the directed degree distributions 
P{kin) and P{kout), where the in/out-degree, kin or kout respec- 
tively, is defined as the number of incoming/outgoing links con- 
necting a page to its neighbors. The work by Kumar et al. 1 23 1 on a 
big crawl of about 40M nodes, and that by Barabasi and Albert 1 5 1 
on a smaller set of over 0.3M nodes restricted to the domain of 
the University of Notre Dame, suggested a scale-free nature for the 
WWW with power-law behaviors both for the in- and out-degree 
distributions. 

Immediately after, a more complete investigation was published 
by Broder et al. 1101 . There, two sets from AltaVista crawls were 
analyzed, corresponding to different months in the same year 1999, 
May and October. The sets had over 200 million pages and 1.5 
billion links. The authors reported detailed measurements on lo- 
cal and global properties of the Web graph which covered, for in- 
stance, the degree distributions, corroborating earlier observations, 
and also the presence and organization of connected components, 
unfolding the so-called bow-tie structure of the Web. One of the 
most intriguing conclusions there was that, from the analysis of 
those two sets, the observed structure of the Web was relatively 
insensitive to the particular large crawl used. In addition, the con- 
nectivity structure of the Web was resilient to the removal of a sig- 
nificant number of nodes. 

Successively, further work liI4J along the same lines has been 
performed over a large 2001 data set of 200M pages and about 1.4 
billion edges made available by the WebBase project at Stanford 
(See next section for references and a project description). In this 



work, new measures were introduced along with the standard sta- 
tistical observables, and the obtained results were compared with 
the ones presented in the work by Broder et ah. One of the re- 
ported differences is the deviation from the power-law behavior of 
the out-degree distribution. 

On the other hand, the question whether subsets of the Web dis- 
play the same characteristics as the Web at large has been discussed 
by a number of authors. Dill et al. 1 13 1 found self-similarity within 
thematically unified subgraphs extracted from a single crawl of 
60M pages gathered in October 2000. On the contrary, the dif- 
ferent components of the bow-tie decomposition have been found 
to lack self-similarity in their inner structure when compared to the 
whole graph ||T?I . 

3. DATA SETS 

To gain some insight about how the crawling strategy affects ob- 
servations and on the existence of observable unbiased properties 
we have analyzed and compared four sets of data corresponding to 
different years, from 2001 to 2004, and different domains, general 
and .Ilk and .it domains. The sets have been gathered within two 
different projects: the WebBase project and the WebGraph project, 
each using its own Web crawler, Web Vac and UbiCrawler respec- 
tively. The WebBase Project is a World Wide Web repository built 
as part of the Stanford Digital Libraries Project by the Stanford 
University InfoLab ' . The Stanford WebBase project^ 1 2 1 1 is inves- 
tigating various issues in crawling, storage, indexing, and querying 
of large collections of Web pages. The project aims to build the 
necessary infrastructure to facilitate the development and testing 
of new algorithms for clustering, searching, mining, and classifi- 
cation of Web content. The Stanford WebBase has been collected 
by the spider Web Vac 1 1 1 3 1 and makes available a Web repos- 
itory with access to general crawls, such as the ones used in this 
research, or specific domain crawls restricted, for instance, to uni- 
versities or institutions. The WebGraph Project' is being devel- 
oped by the Laboratory for Web Algorithmics'* (LAW) at the Uni- 
versity of Milano and analyzes data obtained by its own crawler, 
UbiCrawler^ 1 9 1, designed to achieve high scalability and to be tol- 
erant to failures. 

The above projects provide several data sets publicly available 
to researchers. We analyze four samples ranging from 2001 to 
2004. The WebBase general crawl of 2001 (WBGCOl) and the 
WebBase general crawl of 2003 (WBGC03)'' have been collected 
by the WebBase project in a general crawl using the Web Vac spi- 
der. The remaining two sets collected by the UbiCrawler project, 
the WebGraph .uk domain of 2002 (WGUK02)^ and WebGraph .it 
domain of 2004 (WGIT04)'*, are restricted to the domains .uk and 
respectively. Note that the two domain crawls present an in- 
teresting difference. While pages in the .uk domain have higher 
probability to point to pages outside the domain, due to English be- 
ing the official language in other influential countries, such as the 
USA, and to the widespread use of English, the links in the Italian 
.it domain may be much more endogenous, which could potentially 
have a high effect on the Web description derived from the data. 

We have cleaned the four sets by disregarding multiple links be- 

'http://www-db.stanford.edu/ 

-http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/ 

' http : //webgraph. dsi . unimi . i t/ 

''http://law.dsi.unimi.it/ 

^http://ubi. iit.cnr.it/projects/ubicrawler/ 

'ftp://db.stanford.edu/pub/webbase/ 

'http://webdata.iit.cnr.it/united_kingdom-2002/ 

**http://webdata.iit.cnr.it/italy-2004/ 



Table 1: Number of nodes and edges of the networks consid- 
ered, after extracting multiple links and self -connections. 



Data set 


WBGCOl 


WGUK02 


WBGC03 


WG1T04 


# nodes 


80571247 


18520486 


49296313 


41291594 


# links 


752527660 


292243663 


1185396953 


1135718909 



tween the same pages and self-connections. In TableQwe present 
a summary of the size in vertices and directed edges of the four sets 
analyzed in this paper. 

All the following measures have been carried out using Matlab 
code. ' 

4. STRUCTURAL PROPERTIES 

Data gathered in large scale crawls L23. |^ (6) 110! 117! 1141 have 
uncovered the presence of a complex architecture underlying the 
structure of the Web graph. A widespread feature is the small- 
world property. Despite its huge size, the average number of URL 
links that must be followed to navigate from one document to the 
other, technically the average shortest path length, seems to be very 
small as compared to the value for a regular lattice of comparable 
size, and it seems to grow with the system size very slowly at a log- 
arithmic pace |2 ||10I . Another important result is that the WWW 
exhibits a power-law relationship between the frequency of vertices 
and their degree, defined as the number of directed edges linking 
each vertex to its neighbors. This last feature is the signature of 
a very complex and heterogeneous topology with statistical fluc- 
tuations extending over many length scales |2| l5l I23l . Finally, a 
fascinating macroscopic description of the Web has been provided 
by the study of the connected components, taking into account the 
directed nature of the Web graph 1 10 1 . In the following, we perform 
a careful comparative analysis of the four Web crawls described in 
the previous section. This will allow us to critically examine the 
stability of the various results as a function of the crawl and discuss 
which properties appear to be genuine features of the global Web 
graph. 

4.1 Sizes of connected components 

The directed nature of the Web brings out a complex structure 
of connected components 1 30 16 1 that has been captured in the fa- 
mous bow-tie architecture highlighted in the study presented in 1 10 1 . 
If we disregard the directedness of links, the weakly connected 
component of the graph is made by all pages belonging to the giant 
component of the corresponding undirected graph. The undirected 
component becomes internally structured when the directed nature 
of the connections is considered. The most important of these new 
internal components is called the strongly connected component 
(SCC), which includes all pages mutually connected by a directed 
path. The other two relevant components are the in-component (IN) 
and the out-component (OUT). The first is formed by the vertices 
from which it is possible to reach the SCC by means of a directed 
path. The second refers to the set of vertices that can be reached 
from the SCC by means of a directed path. Finally, other secondary 
structures can also be present, such as tendrils, which contain pages 
that cannot reach the SCC and cannot be reached from it, or tubes 
which can directly connect the IN and OUT components without 
crossing the SCC. This complex composition is usually called the 
bow-tie structure because of the typical shape assumed by the fig- 
ure sketching the relative size of each component (see Fig.0. It is 

'Available upon request. 



Table 2: Sizes of the SCC, IN and OUT components and their 
sum MAIN=SCC+IN+OUT. Notice that MAIN does not con- 
tain either tendrils or tubes, so that it differs from the weakly 
connected component. Values are shown as a percentage of the 
total number of nodes. 



Data set 


WBGCOl 


WGUK02 


WBGC03 


WGIT04 


IN 


17.24 


1.69 


2.28 


0.03 


SCC 


56.46 


65.28 


85.87 


72.30 


OUT 


17.94 


31.88 


11.26 


27.64 


MAIN 


91.64 


98.85 


99.41 


99.98 



WBGCOl WGUK02 




WBGC03 WGIT04 




OIN •see •OUT 



Figure I: Graphical representation of the sizes of the global 
components reported in Table |5| The area of each component 
is proportional to its actual size, so that the relative sizes of the 
components in the figure account for the actual relative sizes of 
the Web graphs. 



clear that such a component structure is extremely relevant in the 
discussion of the functionalities of the Web. For instance, the rel- 
ative sizes of the SCC and the IN and OUT components give us 
information about the probabilities of returning to an original page 
after exploration, or the size of the accessible Web once a starting 
page has been selected. The size of the SCC is of particular im- 
portance, since it constitutes the subset of reversible and complete 
access navigability. When one starts to surf the Web from the IN 
component, it is very likely that after a while one ends up in the 
SCC, and maybe eventually in the OUT component, but can never 
go back to the original point. Once in the OUT component, one can 
never go back to the other main components. But within the SCC, 
all nodes are reachable and can be revisited. 

We summarize the values for the sizes of the components of the 
four data sets in Table |2| The figures for the domain crawls are in 
agreement to those reported in 1 15 1, where the same .iik and .it sets 
were also examined. The analysis of the four data sets considered 
in the present study shows a noticeable variability of the basic com- 
ponent structure of the resulting graph. In particular, the IN com- 
ponent is the most unstable feature that ranges from accounting for 
about 20% of the total structure (WBGCOl) to the case in which 
it is practically absent (WGIT04). This variability could be likely 
ascribed to the different crawling strategies and the fact that each 
of those may use different starting points. Moreover, crawlers per- 
form a directed exploration in the sense that they follow outgoing 
hyperlinks to reach pointed pages, but cannot navigate backwards 
using incoming hyperlinks. This implies that the exploration of the 
IN component is strongly biased by the initial conditions used to 
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Figure 2: Distributions of incoming links. In tiie shadowed 
regions all the functions decay as a power-law with exponents 
given in TableHl 



start the crawl. Variations are however not limited to the IN com- 
ponent. Also the relative sizes of the SCC and the OUT component 
vary from sample to sample, even by a factor close to three in the 
case of the OUT component. Finally, notice that the sizes of the IN 
and OUT components of the WBGCOl set are quite symmetric, as 
was also found in 1 10 1, where the values reported for the sizes of 
the IN, SCC and OUT of components of the AltaVista crawl were 
21.3%, 27.7%, 21.2% respectively. In summary, it is evident from 
this analysis that the structure of Web graphs is strongly dependent 
on the crawler strategies. 

4.2 Degree distributions 

A major interesting feature found in Web graphs is the presence 
of a highly heterogeneous topology, with degree distributions char- 
acterized by wide variability and heavy tails l(2 H 5| f23l . The degree 
distribution P{k) for undirected networks is defined as the proba- 
bility that a node is connected to k other nodes. For directed net- 
works, this function splits in two separate functions, the in-degree 
distribution P{ki„) and the out-degree distribution P{kout), which 
are measured separately as the probabilities of having kin incom- 
ing links and kout outgoing links, respectively. In Figs. |2| and |3] 
we report the behavior of the in-degree and out-degree distribu- 
tions. These distributions, as for most real world networks, are 
found to be very different from the degree distribution of a random 
graph or an ordered lattice. They are both skewed and spanning 
several orders of magnitude in degree values. The in-degree distri- 
bution exhibits a heavy-tailed form approximated by a power-law 
behavior P{ki„) ~ k~^'" , generally spanning over 3 to 4 orders 
of magnitude. In Figure |2| we show the region considered in the 
evaluation of the exponent obtained by a maximum likelihood al- 
gorithm for discrete distributions. The in-degree distributions also 
exhibit a noisy tail that cannot be well fitted with a specific analytic 
form. Yet it strengthens the evidence for the heavy-tailed character 
of P(fc,„). 

A different situation is faced in the case of the out-degree distri- 
bution P(fcout). In this case, a clear exponential cut-off is observed 
and the range of degree values is 2 to 4 orders of magnitude smaller 
than what found for the in-degree distribution. The origin of the 
cut-off can be explained by the different nature of the in-degree 
and out-degree evolution. The in-degree of a vertex is the sum of 



Table 3: Main statistical properties of the analyzed sets: av- 
erage degree (fc), maximum degree kmax, standard deviation 
(J, heterogeneity parameter k, and maximum likelihood esti- 
mate of the exponent of the power-law in-degree distribution 
7in (precision error ±0.1). All values are provided for in- and 
out-degrees and for the four data sets. The symbol cxd for 7out 
means that the out-degree distributions decay faster than a 
power-law. 



Data set 


WBGCOl 


WGUK02 


WBGC03 


WGIT04 


(fc,„> 


9.3 


15.8 


24.1 


27.5 




788632 


194942 


378875 


1326744 




200.2 


143.3 


421.6 


881.4 




4298.6 


1317.5 


7414.9 


28269.9 


li-n 


1.9 


1.7 


2.2 


1.6 




WBGCOl 


WGUK02 


WBGC03 


WGIT04 




9.3 


15.8 


24.1 


27.5 


Kir 


552 


2449 


629 


9964 




13.1 


27.4 


29.5 


67.1 


K.out 


27.7 


63.4 


60.3 


191.0 


7out 


oo 


oo 


OG 


oo 



all the hyperlinks incoming from all the Web pages in the WWW. 
In principle, thus, there is no limit to the number of incoming hy- 
perlinks, that is determined only by the popularity of the Web page 
itself. On the contrary, the out-degree is determined by the number 
of hyperlinks present in the page, which are controlled by Web ad- 
ministrators. For evident reasons (clarity, handling, data storage) it 
is very unlikely to find an excessively large number of hyperlinks 
in a given page. This represents a sort of finite capacity 1261 for 
the formation of outgoing hyperlinks that might naturally lead to a 
finite cut-off in the out-degree distribution. 

The heavy-tailed behavior of the in-degree distribution implies 
that there is a statistically significant probability that a vertex has a 
very large number of connections compared to the average degree 
{kin). In addition, the extremely large value of {kin), ™d there- 
fore of the variance = — {kin)'^ is signalling the extreme 
heterogeneity of the connectivity pattern, since it implies that sta- 
tistical fluctuations are virtually unbounded, and tells us that the 
average degree is not the typical degree value in the system, i.e., 
we have scale-free distributions. The heavy-tailed nature of the 
degree distribution has also important consequences in the dynam- 
ics of processes taking place on top of these networks. Indeed, 
recent studies about network resilience to removal of vertices 1121 
and spreading |29| have shown that the relevant parameter for these 
phenomena is the ratio between the first two moments of the degree 
distribution k = (fc^)/(fc). If k ^ 1 the network manifests some 
properties that are not observed for networks with exponentially de- 
caying degree distributions. In the case of directed networks, this 
heterogeneity parameter has to be defined separately for in- and 
out-degrees as Kj„ = {k^n) / {hn) and Hout = {kout) / {kout) , '° 
since it could happen that the network is heterogeneous with respect 
to one of the degrees but not to the other " In Table |3| we pro- 
vide these values for the empirical graphs along with a summary of 
the numerical properties of the probability distributions analyzed so 



'"Notice that for any directed graph {kin) = {kout)- 

"in addition, a third parameter can be defined which accounts 

for the effect of the crossed one point correlations K.in,out = 

{kinkout) / {kin)- 




Figure 3: Distributions of outgoing linl(s. For visualization pur- 
poses, we use cumulative distributions defined as Pc{kout) — 
J^k' >k P{k'out)- The inset shows the same curves in a 
linear-log scale. 



far. The heavy-tailed behavior is especially evident when compar- 
ing the heterogeneity parameters k and their wide range variations. 
A marked difference is observed for the out-degree distributions 
where the variance and heterogeneity parameters are indicating a 
limited variability of the function P(kout)- From the exponents 
reported for the in-degree distribution, it results evident that the 
fittings to a power-law form can yield slightly different results, de- 
pending on the data set analyzed. These variations could signal 
a slightly different structure of the Web graph depending on the 
domain crawled or the eventual presence of statistical biases due 
to the crawling strategy. It is interesting to notice that a similar 
variability is encountered in studies of the power-law behavior of 
Web samples restricted to specific thematic groups |31 1. Another 
oddity that has to be signalled is the fact that the general crawls 
WBGCOl and WBGC03 exhibit a much smaller cut-off of the out- 
degree distribution than observed in the two domain crawls. This 
is somehow counterintuitive given the larger sizes of the general 
crawls. This might hint to the presence of a bias in the way hyper- 
links are explored by different crawlers, again purporting evidence 
for the presence of sampling biases that affect the observed statis- 
tical properties of Web graphs. 

5. DEGREE CORRELATIONS 

As an initial discriminant of structural ordering, the attention 
has been focused on the networks' degree distribution. This func- 
tion is, however, only one of the many statistics characterizing 
the structural and hierarchical ordering of a network; a full ac- 
count of the connectivity pattern calls for the detailed study of de- 
gree correlations. Along these lines, for instance, it is possible to 
provide a quantitative study of the mixing properties of networks 
through opportune projection of the degree-degree joint probabil- 
ity distribution. This allows the distinction between assortative net- 
works, in which large degree nodes preferentially attach to large 
degree nodes, and disassortative networks, showing the opposite 
tendency 1271 . These structural properties are the signature of spe- 
cific ordering principles. 

5.1 Single vertex degree correlations 

First, we examine local one-point degree correlations for indi- 
vidual nodes, in order to understand if there is a relation between 
the number of incoming and outgoing links in single pages. Since 
most of the analyzed degree distributions are heavy-tailed, fluctu- 
ations are extremely large so that the linear correlation coefficient 
is not well defined for those cases. Instead, we provide the crossed 



Table 4: Crossed in-degree out-degree correlations for individ- 
ual nodes, normalized by the uncorrelated values. 



Data set 


WBGCOl 


WGUK02 


WBGC03 


WGIT04 




2.8 


3.1 


1.6 


5.6 




Figure 4: Normalized average out-degree as a function of the 
in-degree for the four different data sets. 



one-point correlations, {kinkout), normalized by the correspond- 
ing uncorrelated value, (kin) (kout) ■ We also report the function 



(kout{kin^') 



ieT{fci„) 



(1) 



which measures the average out-degree of nodes as a function of 
their in-degree. A^fe^^ stands for the number of nodes with in-degree 
kin and kout,i is the out-degree of node i. The notation i € T(fci„) 
indicates that the summation has to be performed over the set of 
nodes of degree ki„, denoted by T{kin)- The results can be found 
in Tableland in Fig.|4] 

A significant positive correlation between the in-degrees and the 
out-degrees of single nodes is found for all the sets. That means 
that more popular pages tend to point to a higher number of other 
pages. This positive correlation is found to be true for a range of 
in-degrees that spans from kin = 1 to kin ~ 10^ ~ 10"', depend- 
ing on the specific set. Beyond this point no noticeable correlation 
is observed, see Fig. |4] The set for the Italian domain is more 
noisy, but this pattern appears to be independent of the crawl used 
to gather the data and, thus, it seems to be a genuine feature of the 
Web. 

5.2 Two-vertex degree correlations 

Another important source of information about the network struc- 
tural organization lies in the correlations of the degrees of neigh- 
boring vertices. These correlations can be probed in undirected 
networks by inspecting the average degree of nearest neighbors of 
a vertex i, where nearest neighbors refers to the set of vertices at a 
hop distance equal to 1, 



knn/i 



t\jl 



(2) 



The sum runs over the nearest neighbor vertices of each vertex i, 
gathered in the set From this quantity, a convenient measure 
is obtained by averaging over degree classes to obtain the average 
degree of the nearest neighbors for vertices of degree k, defined 




Out- degree of In- degree of Figure 6: Degree-degree correlations for the four different data 

In-neighbors Out-neighbors sets. Explicit expressions for the quantitative definition of these 

functions can be found in AppendixlAl 



Figure 5: Graphical sketch illustrating the degree-degree cor- 
relation functions defined in section lOI We focus on a single 
node -the central node in the figures- with in-degree kin ~ 2 
and out-degree kout ~ 3. In a) the average in-degree of its in- 
neighbors is computed taking into account the incoming arrows 
inside the grey area. The function kin.nnikin) is then the aver- 
age of this quantity over all nodes with the same in-degree. The 
rest of the functions are defined in a similar way, as highlighted 
in b), c), and d). 

as OH 

knnik^ ~ ^ ^ krtn^i ~ ^ ^ k P{k (3) 

* i6T{fc) k' 

where Nk is the number of nodes with degree k, the notation i G 
T(fc) indicates that the summation has to be performed over the set 
of nodes of degree k, denoted by T(fc), and P{k'\k) quantifies the 
conditional probability that a vertex with degree k is connected to 
a vertex with degree k' . This measure provides a sharp proof of the 
presence or absence of correlations. In the case of uncorrelated net- 
works, the degrees of connected vertices are independent random 
quantities, so that P{k'\k) is only a function of k' . In this case, 
knn{k) does not depend on k and equals k = (k^) / (k). Therefore, 
a function knn{k) showing any exphcit dependence on k signals 
the presence of degree correlations in the system. Real networks 
usually tend to display one of two different patterns |27|. Assor- 
tative networks exhibit fc„„(fc) functions increasing with k, which 
denotes that vertices are preferentially connected to other vertices 
with similar degree. Examples of assortative behavior are typically 
found in many social structures. On the other hand, disassorta- 
tive networks exhibit knn{k) functions decreasing with k, which 
denotes that vertices are preferentially connected to other vertices 
with very different degree. Examples of disassortative behavior 
are typically found in several technological networks, as well as in 
communication and biological networks. 

In the case of the WWW, the study of the degree-degree corre- 
lation functions is naturally affected by the directed nature of the 
graph. In [TJ, a set of directed degree-degree correlation functions 



was defined considering that, in this case, the neighbors can be re- 
stricted to those connected by a certain type of directed link, either 
incoming or outgoing. For the WWW, we study the most signif- 
icant distributions, taking into account that we can partition the 
neighborhood of each single node i into neighboring nodes con- 
nected to it by incoming links and neighboring nodes connected 
to it by outgoing links. A first correlation indicator, ki„,nn{kin), 
is defined as the normalized average in-degree of the neighbors of 
nodes of in-degree kin, when those neighboring nodes are found 
following incoming links of the original node, see Fig.|5la). If we 
measure the popularity of Web pages in terms of the number of 
pages pointing to them, this function quantifies the average pop- 
ularity of pages pointing to pages with a certain popularity. The 
exact definition is given in AppendixlAl along with the expression 
for the normalization factor. The rest of the correlation functions, 

kout^nn^kin)-, kout,nn(^kout^ , kin^nn (kout) can be defined in an 

analogous manner. Each plot in Fig.|6|shows these correlation func- 
tions for the four data sets analyzed in this paper. Remarkably, only 
one of the functions shows an increasing pattern denoting the pres- 
ence of assortative correlations for the four data sets. The average 
out-degree of neighbors of nodes of high out-degree is also high, 
so that the average number of references is high in pages pointed 
by pages with a high number of references. In all other cases, very 
mild or a complete lack of correlation is observed. This is somehow 
surprising since, from the observed similarities in the correlation 
patterns, one cannot infer the differences in the structural proper- 
ties observed in Sec. l4. ll for the different Web graphs. 

6. THE ROLE OF RECIPROCAL LINKS 

While a directed network, the Web has many pages pointing to 
each other. A couple of pages pointing to each other corresponds 
to the presence of a reciprocal link that can be considered as undi- 
rected. These reciprocal connections play an important role and in 
this section we introduce and investigate reciprocal links as crucial 
elements in the understanding of the WWW. To this end, we will 
differentiate into incoming, outgoing, and reciprocal links, where 
incoming and outgoing links do not include the ones taking part in 
reciprocal connections and are referred to as non-reciprocal. This 




10° lo' 10" lo' 



1, 

Figure 7: Probability distributions of reciprocal links. The 
inset shows the distributions for the two general crawls in a 
linear-log scale. 

Table 5: Main statistical properties of the reciprocal sub- 
graphs: average degree (gr), maximum degree g,™"^, standard 
deviation Or, heterogeneity parameter Kr, and maximum like- 
lihood estimate of the exponent of the power-law in-degree dis- 
tribution 7r (precision error ±0.1). The symbol oo means that 
the distribution decays faster than a power-law. 
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allows us to consider reciprocal and non-reciprocal connections as 
separate and well-defined independent entities and provides a sta- 
tistical analysis able to capture additional information of the Web 
structure and the sampling biases eventually observed in different 
data sets. 

6.1 Degree distributions 

For the sake of notation, in the following we will identify the 
non-reciprocal in-degree and out-degree of a given vertex i with 
qin.i and qout,i, respectively. Analogously, the reciprocal degree 
(r-degree) Qr.i indicates the number of reciprocal connections to 
neighboring vertices. While the degree distributions of non-reciprocal 
links are extremely similar to those obtained for the global in and 
out-degree, the reciprocal degree distribution appears to exhibit a 
striking different behavior depending on the crawl examined. In 
particular, general crawls show a distribution P{qr) with an ex- 
ponentially fast decaying behavior, while the domain crawls have 
a heavy-tailed distribution varying over three orders of magnitude 
(see Fig.0. In Table|5] we summarize the main properties of P{qr) 
for the various data sets. Also from the values shown there one can 
easily see the mild fluctuations and heterogeneity expressed by the 
general crawl data sets. The evident differences in the reciprocal 
degree distributions match the dissimilar component structure ob- 
served in general and domain crawls. On the other hand, the origin 
of the two different statistical behaviors does not find a clear ex- 
planation and deserves further investigation. In particular, it is not 
possible to find an easy explanation either in the crawling strate- 
gies or in the eventual features of Web specific domains. Finally, 
once again we have to emphasize the odd finding of general crawls 
showing reciprocal degree distribution cut-offs much smaller than 



Table 6: Crossed non-reciprocal in-degree, out-degree, and r- 
degree correlations for individual nodes. 
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those observed for domain crawls. 

6.2 One-point degree correlations 

The distinction between reciprocal and non-reciprocal links in- 
duces a higher complexity even at the most local level. In this 
case, each node is characterized by three different quantities. Con- 
sequently, we need to introduce three correlation measures, i.e., 
the average non-reciprocal out-degree as a function of the non- 
reciprocal in-degree, (gout(gm)), and the average r-degree as a 
function of the number of non-reciprocal incoming and outgoing 
links, (qriqin)) and (gr(qout)), respectively (see Fig. |8). A sur- 
prising result is that, in this case, there is no clear correlation be- 
tween non-reciprocal in- and out- degrees but there is a positive 
correlation between reciprocal and non-reciprocal in-degrees. So, 
the positive correlation previously observed between in- and out- 
degrees is just a consequence of this new correlation. 

6.3 Degree-degree correlations 

The two vertices correlation analysis presented in section 5.2 can 
be repeated for the non-reciprocal and reciprocal decomposition of 
the network. Now, we have to differentiate reciprocal links and 
segregate the neighborhood of each single node i into neighbor- 
ing nodes connected to it by non-reciprocal incoming links, neigh- 
boring nodes connected to it by non-reciprocal outgoing links, and 
neighboring nodes connected to it by reciprocal links. The degree- 
degree correlation functions corresponding to the first two cases 
give similar results to the ones presented in the previous section 
and do not signal the presence of any relevant correlation pattern 
(not plotted). 

A very different picture is obtained when we measure correla- 
tions following reciprocal connections. A strong positive correla- 
tion is observed between the in-degrees of nodes connected by re- 
ciprocal links. This is clearly visible in the upper left plot of Fig.|9| 
which shows the normalized average non-reciprocal in-degree of 
the neighbors of nodes of non-reciprocal in-degree qin, when the 
neighbors are found following reciprocal links, qi„,nn{qin \r). This 
function shows a clear increase of two orders of magnitude as a 
function of qi„, indicating an assortative correlation. The same 
behavior is found between non-reciprocal out-degrees (lower right 
plot of Fig. |9j. Concerning the crossed correlations, we observe 
again a positive correlation between the neighboring non-reciprocal 
in-degree and the non-reciprocal out-degree but no noticeable cor- 
relation in the opposite one, that is, the average non-reciprocal out- 
degree of the reciprocal neighbors of a node is independent of the 
non-reciprocal in-degree of that node (see lower left plot in Fig. 
|9}. In summary, the analysis of the two-vertex degree correlation 
behavior indicates that most of the structural correlations of Web 
graphs are found in vertices connected by reciprocal links. This 
type of links therefore represents an element of particular interest 
in that they express the ordering principles (beyond simple random- 
ness) at the basis of the Web structure. 

6.4 The reciprocal subgraph 
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Figure 8: One node correlations for tlie four different data 
sets. Tlie functions shown are tlie normalized average non- 
reciprocal out-degree as a function of the non-reciprocal in- 
degree, and the normalized average r-degree as a function of 
the non-reciprocal in- and out-degrees. 



Very interesting information is provided by the study of how re- 
ciprocal Unks are structurally organized among them. If we look 
at the subgraph formed by the vertices and the reciprocal links 
we can use the tools adopted for undirected graphs. A measure 
of the two vertices correlation function is therefore expressed by 
qr.nn (Qr) (scc Sec. l5.2> . i.e., the standard measure of an undirected 
network if we identify reciprocal links as undirected. As shown 
in Fig. llOl this function shows a first decrease, for q,. < 10, fol- 
lowed by a linear increase up to a critical value depending on the 
crawler. At high reciprocal degrees, a cloud of points is populating 
the low r-degree region of the average nearest neighbor recipro- 
cal degree. This defines a bi-modal pattern which indicates two 
different behaviors. The low values cloud can be interpreted as 
a collection of star-like structures, with central hubs connected to 
low degree nodes. This effect is probably due to the "home" button 
in many Web pages that belong to a bigger site. The linear be- 
havior may have two different interpretations. The first one is that 
the network is a tree in which high degree nodes are connected to 
other high degree nodes. The second one is that the network forms 
clique-like structures, that is, groups of pages pointing simultane- 
ously to each other. To discern which scenario is more appropriate 
we inspect the local connectivity properties of reciprocally linked 





Figure 9: Non reciprocal degree-degree correlations for the 
four different data sets. 



vertices. Since we can treat the reciprocal subgraph as an undi- 
rected one, we can probe the local interconnectedness by analyzing 
the clustering coefficient defined as the fraction of inter-connected 
neighbors of j: Cj = 2 ■ niini[/(gr,j (grj — 1)), where rzunk is 
the number of reciprocal links between the g^j reciprocal neigh- 
bors of j. This quantity measures the density of interconnected 
vertex triplets and it is therefore close to one in the case of a fully 
interconnected neighborhood and zero in the case of a tree struc- 
ture. Global statistical information can be gathered by inspecting 
the average clustering coefficient c{qr) restricted to classes of ver- 
tices with reciprocal degree qr- In the first scenario, c{qr) should 
be very small and decreasing with the degree because of the tree- 
like structure. In the second one c{qr) should be significant and 
independent of the degree. In Fig. llOl we show the function c{qr) 
which exhibits a high and constant value followed by a cloud of 
points with very low clustering coefficient at the same point where 
the function q^ nnilr) also splits. This indicates that the orga- 
nization of the reciprocal subgraph is a set of star-like structures 
combined with cliques, or communities, of highly interconnected 
pages. Very interestingly, this pictorial characterization appears to 
be the same in all Web graphs considered, pointing out to a genuine 
feature of the Web graph. The present analysis identifies in the re- 
ciprocal subgraph an important element that might help in decoding 
the structure of the WWW. Finally we have to stress that the recip- 
rocal component is surely extremely important for the analysis and 
understanding of navigation patterns and the network resilience to 
link removal. 



7. OUTLOOK 

Contrary to what happened with the scrutiny of Internet maps, 
the issue of sampling biases in the structure of the WWW has been 
left almost untouched. The large size of the data sets has led to the 
belief that the global properties were well defined in view of the 
abundant statistics available. Noticeably, from the present analy- 
sis, it appears that the resulting picture of the WWW structure and 
its statistical characterization can be considerably affected by the 
design of the tools we use to observe it. While some of the ba- 
sic properties are qualitatively preserved across different data sets, 
other features and quantities are highly variable. This results in a 
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Figure 10: Average nearest neighbors degree (top) and degree- 
dependent clustering coefficient (bottom) for the reciprocal 
links and for all the samples. 



fuzzy picture of the WWW structure, where samphng biases still 
play a major role. In other words, we are still in a position where 
it is impossible to have a definite conceptual framework to decode 
the structure of the global Web and how effectively we can navi- 
gate, search, index, or mine the Web. The present work thus high- 
lights the need for a theoretical framework able to approach a de- 
tailed analysis and understanding of the sampling biases implicit in 
the most widely used crawling strategies. In this sense, numerical 
studies of simulated exploration of directed network models could 
be a starting point to approach this problem and have a preliminary 
assessment of the intrinsic biases induced by the crawling process. 
Finally, the results presented in this paper are potentially helpful 
for improving the design of future crawlers, not only regarding la- 
tent biases. These applications are improved to a great extent when 
they take advantage of the special hyperlink structure among web 
documents and, at this respect, reciprocal links could play a key 
role which has to be explored in more detail. 
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reciprocal incoming links, the set t'i^(j), neighbors connected to 
it by non-reciprocal outgoing links, the set Vout{i), and neighbors 
connected to it by reciprocal links, the set Vrii). The functions 
given in Eq.|4|are valid whenever the in and out subscripts are re- 
stricted to non-reciprocal links. When following only reciprocal 
links, one can redefine them in a similar way: 
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and the normalization terms in this case are 
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APPENDIX 

A. DEGREE-DEGREE CORRELATIONS: 
QUANTITATIVE DEFINITIONS 

We study the most significant two-point correlation functions, 
taking into account that we can segregate the neighborhood of each 
single node i into neighboring nodes connected to it by incoming 
links, the set and neighboring nodes connected to it by out- 

going links, the set Vout{i). Following Eq.j3}> we can write 
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(4) 

These measures are normalized by the corresponding uncorre- 
lated values defined in section|42|as the heterogeneous parameters 
Kin, out, i^in, and Kout, in order to make them independent of the 
system size and so comparable across samples. 

The same quantities can be calculated when non-reciprocal and 
reciprocal links are differentiated. Now, the neighborhood of each 
single node i is segregated into neighbors connected to it by non- 



